{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 更多关于多元高斯分布的知识\n",
    "\n",
    "截至目前，我们已经在不少的应用中见到了多元高斯分布的身影，比如线性回归的概率解释、高斯判别分析、高斯混合模型聚类、以及因子分析模型。在这篇笔记中，我们将揭示关于多元高斯分布的（已经在因子分析模型中引入的）更多优良特性，并给我们这些特性的直观概念。\n",
    "\n",
    "## 1. 定义\n",
    "\n",
    "一个向量形式的随机变量$x\\in\\mathbb R^n$，以$\\mu\\in\\mathbb R^n$为期望、$\\varSigma\\in\\mathbb S_{++}^n$为协方差矩阵（在[线性代数](sn01.ipynb)笔记中$\\mathbb S_{++}^n$为$n\\times n$正定对称矩阵空间，具体定义为$\\mathbb S_{++}^n=\\left\\{A\\in\\mathbb R^{n\\times n}: A=A^T,\\ \\forall x\\in\\mathbb R^n\\land x\\neq0\\to x^TAx\\gt0\\right\\}$），如果随机变量的概率密度函数满足：\n",
    "\n",
    "$$\n",
    "p\\left(x;\\mu,\\varSigma\\right)=\\frac{1}{\\left(2\\pi\\right)^{n/2}\\left\\lvert\\varSigma\\right\\rvert^{1/2}}\\exp\\left(-\\frac{1}{2}\\left(x-\\mu\\right)^T\\varSigma^{-1}\\left(x-\\mu\\right)\\right)\n",
    "$$\n",
    "\n",
    "则称该随机变量服从**多元正太（高斯）分布（multivariate normal (or Gaussian) distribution）**，写作$x\\sim\\mathcal N(\\mu,\\varSigma)$。\n",
    "\n",
    "## 2. 相关特性\n",
    "\n",
    "多元高斯分布在各种场景中都很好用通常是因为以下几个特性：\n",
    "\n",
    "1. 如果我们知道随机变量$x$的期望$\\mu$和协方差$\\varSigma$，就可以直接得到随机变量$x$的概率密度函数；\n",
    "2. 以下关于多元高斯分布的积分均有明确的解析解：\n",
    "\n",
    "    $$\n",
    "    \\begin{align}\n",
    "    \\int_{x\\in\\mathbb R^n}p\\left(x;\\mu,\\varSigma\\right)\\mathrm dx &= \\int_{-\\infty}^\\infty\\cdots\\int_{-\\infty}^\\infty p\\left(x;\\mu,\\varSigma\\right)\\mathrm dx_1\\cdots\\mathrm dx_n=1\\\\\n",
    "    \\int_{x\\in\\mathbb R^n}x_ip\\left(x;\\mu,\\sigma^2\\right)\\mathrm dx &= \\mu_i\\\\\n",
    "    \\int_{x\\in\\mathbb R^n}\\left(x_i-\\mu_i\\right)\\left(x_j-\\mu_j\\right)p\\left(x;\\mu,\\sigma^2\\right)\\mathrm dx &= \\varSigma_{ij}\n",
    "    \\end{align}\n",
    "    $$\n",
    "\n",
    "3. 高斯分布具有下列闭包性：\n",
    "    * 相互独立的高斯随机变量之和仍是高斯随机变量；\n",
    "    * 联合高斯分布的边缘分布仍是高斯分布；\n",
    "    * 联合高斯分布的条件分布仍是高斯分布；\n",
    "\n",
    "这几个特性，尤其是前两个，都非常直观（或至少看上去正确）。可能唯一不是很明确的是，为什么这几个特性会如此强大，在这篇笔记中，我们将给出关于这几个特性的直观概念，并介绍在平常遇到的问题中如何使用多元高斯分布处理随机变量。\n",
    "\n",
    "## 3. 闭包规则\n",
    "\n",
    "在这一小节，我们将介绍前面提到的闭包性，并会使用特性1、2证明这些闭包规则，或给出这些规则为什么成立的直观解释。\n",
    "\n",
    "后面会介绍的内容有：\n",
    "\n",
    "$$\n",
    "\\begin{array}\n",
    "{c|c c c}\n",
    "&\\textrm{求和}&\\textrm{边缘分布}&\\textrm{条件分布}\\\\\n",
    "\\hline\n",
    "\\textrm{为什么是高斯分布？}&\\textrm{不介绍}&\\textrm{介绍}&\\textrm{介绍}\\\\\n",
    "\\textrm{相应的概率密度函数}&\\textrm{介绍}&\\textrm{介绍}&\\textrm{介绍}\n",
    "\\end{array}\n",
    "$$\n",
    "\n",
    "### 3.1 相互独立的高斯随机变量之和仍是高斯随机变量\n",
    "\n",
    "这条规则的正式描述应为：\n",
    "\n",
    "* 设$y\\sim\\mathcal N(\\mu,\\varSigma),\\ z\\sim\\mathcal N(\\mu',\\varSigma')$是服从高斯发布的相互独立的随机变量，其中$\\mu,\\mu'\\in\\mathbb R^n,\\ \\varSigma,\\varSigma'\\in\\mathbb S_{++}^n$，则它们之和是高斯分布：\n",
    "\n",
    "    $$\n",
    "    y+z\\sim\\mathcal N(\\mu+\\mu',\\varSigma+\\varSigma')\n",
    "    $$\n",
    "\n",
    "在给出证明之前，先观察这个式子：\n",
    "\n",
    "1. 首先要指出，独立性假设在这个命题中是非常重要的。设有随机变量$y\\sim\\mathcal N(\\mu,\\varSigma)$，其期望为$\\mu$、协方差矩阵为$\\varSigma$，再假设$z=-y$（不独立），则显然$z$也服从高斯分布$z\\sim\\mathcal N(-\\mu,\\varSigma)$，但此时$y+z$恒为零。\n",
    "2. 然后是澄清一个疑问：如果我们将两个高斯模型（类似一个高维空间的“鼓包”）相加，难道不是得到了某种“双峰模型”（两个“鼓包”）吗？这里我们要理解，两个随机变量相加$y+z$的规则并不是将每个独立的随机变量简单的概率密度简单相加，随机变量$y+z$的概率密度是将$y$和$z$的概率密度做[卷积](https://zh.wikipedia.org/wiki/%E5%8D%B7%E7%A7%AF)（[Convolution](https://en.wikipedia.org/wiki/Convolution)）。\n",
    "\n",
    "    证明两个高斯概率密度之和仍为高斯概率密度超出了本课程范围，这里仅给出卷积的例子，设$y,z$为单变量高斯分布$y\\sim\\mathcal N(\\mu,\\varSigma),\\ z\\sim\\mathcal N\\left(\\mu',\\varSigma'\\right)$，则它们概率密度的卷积为：\n",
    "    $$\n",
    "    \\begin{align}\n",
    "    p\\left(y+z;\\mu,\\mu',\\varSigma^2,\\varSigma'^2\\right)&=\\int_{-\\infty}^\\infty p\\left(w;\\mu,\\sigma^2\\right)p\\left(y+z-w;\\mu',\\sigma'^2\\right)\\mathrm dw\\\\\n",
    "    &=\\int_{-\\infty}^\\infty\\frac{1}{\\sqrt{2\\pi}\\sigma}\\exp\\left(-\\frac{1}{2\\sigma^2}(w-\\mu)^2\\right)\\cdot\\frac{1}{\\sqrt{2\\pi}\\sigma'}\\exp\\left(-\\frac{1}{2\\sigma'^2}\\left(y+z-w-\\mu'\\right)^2\\right)\\mathrm dw\n",
    "    \\end{align}\n",
    "    $$\n",
    "\n",
    "这里我们由观测得到卷积运算给出了某种高斯分布，那么根据特性1，通过卷积运算能够得到$p(y+z;\\mu,\\varSigma)$。不过特性1也指出，如果仅求出期望向量与协方差矩阵，也可以得到这个高斯分布，对我们来说明显这样更简单：\n",
    "\n",
    "根据期望的线性性质有：\n",
    "\n",
    "$$\n",
    "\\mathrm E\\left[y_i+z_i\\right]=\\mathrm E\\left[y_i\\right]+\\mathrm E\\left[z_i\\right]=\\mu_i+\\mu_i'\n",
    "$$\n",
    "\n",
    "所以$y+z$的期望就是$\\mu+\\mu$。同样的，协方差矩阵的第$(i,j)$元素为：\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "\\mathrm E&\\left[(y_i+z_i)\\left(y_j+z_j\\right)\\right]-\\mathrm E\\left[y_i+z_i\\right]\\mathrm E\\left[y_j+z_j\\right]\\\\\n",
    "&=\\mathrm E\\left[y_iy_j+z_iy_j+y_iz_j+z_iz_j\\right]-\\left(\\mathrm E\\left[y_i\\right]+\\mathrm E\\left[z_i\\right]\\right)\\left(\\mathrm E\\left[y_j\\right]+\\mathrm E\\left[z_j\\right]\\right)\\\\\n",
    "&=\\mathrm E\\left[y_iy_j\\right]+\\mathrm E\\left[z_iy_j\\right]+\\mathrm E\\left[y_iz_j\\right]+\\mathrm E\\left[z_iz_j\\right]-\\mathrm E\\left[y_i\\right]\\mathrm E\\left[y_j\\right]-\\mathrm E\\left[z_i\\right]\\mathrm E\\left[y_j\\right]-\\mathrm E\\left[y_i\\right]\\mathrm E\\left[z_j\\right]-\\mathrm E\\left[z_i\\right]\\mathrm E\\left[z_j\\right]\\\\\n",
    "&=\\left(\\mathrm E\\left[y_iy_j\\right]-\\mathrm E\\left[y_i\\right]\\mathrm E\\left[y_j\\right]\\right)+\\left(\\mathrm E\\left[z_iz_j\\right]-\\mathrm E\\left[z_i\\right]\\mathrm E\\left[z_j\\right]\\right)+\\left(\\mathrm E\\left[z_iy_j\\right]-\\mathrm E\\left[z_i\\right]\\mathrm E\\left[y_j\\right]\\right)+\\left(\\mathrm E\\left[y_iz_j\\right]-\\mathrm E\\left[y_i\\right]\\mathrm E\\left[z_j\\right]\\right)\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "因为$y,z$相互独立，则有$\\mathrm E\\left[z_iy_j\\right]=\\mathrm E\\left[z_i\\right]\\mathrm E\\left[y_j\\right]$、$\\mathrm E\\left[y_iz_j\\right]=\\mathrm E\\left[y_i\\right]\\mathrm E\\left[z_j\\right]$，于是最后两项为零，剩下的项为：\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "\\mathrm E&\\left[(y_i+z_i)\\left(y_j+z_j\\right)\\right]-\\mathrm E\\left[y_i+z_i\\right]\\mathrm E\\left[y_j+z_j\\right]\\\\\n",
    "&=\\left(\\mathrm E\\left[y_iy_j\\right]-\\mathrm E\\left[y_i\\right]\\mathrm E\\left[y_j\\right]\\right)+\\left(\\mathrm E\\left[z_iz_j\\right]-\\mathrm E\\left[z_i\\right]\\mathrm E\\left[z_j\\right]\\right)\\\\\n",
    "&=\\varSigma_{ij}+\\varSigma_{ij}'\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "所以得到了$y+z$的协方差矩阵为$\\varSigma+\\varSigma'$。\n",
    "\n",
    "现在回忆一下刚才的计算过程，我们使用了期望的性质和随机变量的独立假设，计算出随机变量求和后$y+z$的期望和协方差矩阵。那么根据特性1，我们就可以在不进行卷积运算的情况下求出$y+z$的概率密度了。（当然，我们提前“观察”得到了求和后的分布仍是一个高斯分布。）\n",
    "\n",
    "### 3.2 联合高斯分布的边缘分布仍是高斯分布\n",
    "\n",
    "这条规则的正是描述应为：\n",
    "\n",
    "设有$\n",
    "\\begin{bmatrix}\n",
    "x_A\\\\x_B\n",
    "\\end{bmatrix}\n",
    "\\sim\n",
    "\\mathcal N\n",
    "\\left(\n",
    "\\begin{bmatrix}\n",
    "\\mu_A\\\\\\mu_B\n",
    "\\end{bmatrix}\n",
    ",\n",
    "\\begin{bmatrix}\n",
    "\\varSigma_{AA}&\\varSigma_{AB}\\\\\n",
    "\\varSigma_{BA}&\\varSigma_{BB}\n",
    "\\end{bmatrix}\n",
    "\\right)\n",
    "$，其中的随机变量为$x_A\\in\\mathbb R^m,x_B\\in\\mathbb R^n$，式中的期望向量及协方差矩阵都是满足这两个随机变量维度要求的。则边缘分布：\n",
    "$$\n",
    "\\begin{align}\n",
    "p\\left(x_A\\right)&=\\int_{x_B\\in\\mathbb R^n}p\\left(x_A,x_B;\\mu,\\varSigma\\right)\\mathrm dx_B\\\\\n",
    "p\\left(x_B\\right)&=\\int_{x_A\\in\\mathbb R^m}p\\left(x_A,x_B;\\mu,\\varSigma\\right)\\mathrm dx_A\n",
    "\\end{align}\n",
    "$$\n",
    "均为高斯分布：\n",
    "$$\n",
    "\\begin{align}\n",
    "x_A&\\sim\\mathcal N\\left(\\mu_A,\\varSigma_{AA}\\right)\\\\\n",
    "x_B&\\sim\\mathcal N\\left(\\mu_B,\\varSigma_{BB}\\right)\\\\\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "下面我们来证明这个规则，关注随机变量$x_A$的边缘分布（对于服从高斯分布的向量型随机变量$x$，我们总是可以转置随机变量$x$，就像我们转置其期望向量和协方差矩阵的行列一样。所以，只要证明规则在随机变量$x_A$上成立，就可以将结论推广到随机变量$x_B$）：\n",
    "\n",
    "容易看出，计算边缘分布的期望和协方差矩阵是比较容易的：只需要取联合分布的期望、协方差矩阵的相应分块矩阵即可——为了证明这个结论，我们来看$x_{A,i},x_{A,j}$这两个元素（及随机变量$x_A$的第$i$个和第$j$个元素）。同时，$x_{A,i},x_{A,j}$也是$\\begin{bmatrix}x_A\\\\x_B\\end{bmatrix}$的第$i,j$个元素（因为$x_A$位于这个向量的上半部分）。要找到对应的协方差，我们可以观察协方差矩阵$\\begin{bmatrix}\\varSigma_{AA}&\\varSigma_{AB}\\\\\\varSigma_{BA}&\\varSigma_{BB}\\end{bmatrix}$的第$(i,j)$元素。这个元素位于协方差分块矩阵$\\varSigma_{AA}$中，准确的说就是$\\varSigma_{AA,ij}$。依照这个论点，接下来只需要查看所有的$i,j\\in\\{1,\\cdots,m\\}$就可以发现，$x_A$的协方差矩阵就是$\\varSigma_{AA}$。类似的论点可以用于计算$x_A$的均值，就是$\\mu_A$。因此，上面这论点告诉我们，如果我们知道了关于$x_A$的边缘分布是一个高斯分布，则我们可以立刻通过联合分布的期望向量和协方差矩阵的恰当的分块矩阵写出$x_A$的密度函数。\n",
    "\n",
    "不过，上面的这个论点并不那么令人满意，因为我们并不能直接看出$x_A$一定服从多元高斯分布。这个论点的证明比较冗长，总体分为四个步骤：\n",
    "\n",
    "1. 严格的用积分形式表达出边缘分布；\n",
    "2. 将协方差矩阵的逆写成分块矩阵的形式；\n",
    "3. 使用配方法（completion of squares）计算关于$x_B$的积分；\n",
    "4. 证明得到的结果服从高斯分布。\n",
    "\n",
    "现在来看如何实现这四步：\n",
    "\n",
    "#### 3.2.1 边缘分布概率密度的积分形式\n",
    "\n",
    "假设我们想要直接计算$x_A$的概率密度函数，则需要计算下面这个积分：\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "p\\left(x_A\\right)&=\\int_{x_B\\in\\mathbb R^n}p\\left(x_A,x_B;\\mu,\\varSigma\\right)\\mathrm dx_B\\\\\n",
    "&=\\frac{1}{(2\\pi)^{\\frac{m+n}{2}}\\ \\begin{vmatrix}\\varSigma_{AA}&\\varSigma_{AB}\\\\\\varSigma_{BA}&\\varSigma_{BB}\\end{vmatrix}^{1/2}}\n",
    "\\int_{x_B\\in\\mathbb R^n}\\exp\\left(-\\frac{1}{2}\\begin{bmatrix}x_A-\\mu_A\\\\x_B-\\mu_B\\end{bmatrix}^T\\begin{bmatrix}\\varSigma_{AA}&\\varSigma_{AB}\\\\\\varSigma_{BA}&\\varSigma_{BB}\\end{bmatrix}^{-1}\\begin{bmatrix}x_A-\\mu_A\\\\x_B-\\mu_B\\end{bmatrix}\\right)\\mathrm dx_B\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "#### 3.2.2 协方差矩阵分块\n",
    "\n",
    "我们还需要将指数部分的矩阵乘法写成另一种形式。定义矩阵$V\\in\\mathbb R^{(m+n)\\times(m+n)}$为$V=\\begin{bmatrix}V_{AA}&V_{AB}\\\\V_{BA}&V_{BB}\\end{bmatrix}=\\varSigma^{-1}$。\n",
    "\n",
    "可以将这个矩阵视为$V=\\begin{bmatrix}V_{AA}&V_{AB}\\\\V_{BA}&V_{BB}\\end{bmatrix}=\\begin{bmatrix}\\varSigma_{AA}&\\varSigma_{AB}\\\\\\varSigma_{BA}&\\varSigma_{BB}\\end{bmatrix}^{-1}\"=\"\\begin{bmatrix}\\varSigma_{AA}^{-1}&\\varSigma_{AB}^{-1}\\\\\\varSigma_{BA}^{-1}&\\varSigma_{BB}^{-1}\\end{bmatrix}$。要注意最右边（带引号的等号）并不成立，后面的步骤会回来继续讨论这个问题。现在只看$V$的定义就像了，不需要去在意分块矩阵究竟是什么。\n",
    "\n",
    "根据$V$的定义，积分可以写作：\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "p\\left(x_A\\right)=\\frac{1}{Z}\\int_{x_B\\in\\mathbb R^n}\\exp\\Bigg(-\\bigg[\n",
    "&\\frac{1}{2}\\left(x_A-\\mu_A\\right)^TV_{AA}\\left(x_A-\\mu_A\\right)+\\frac{1}{2}\\left(x_A-\\mu_A\\right)^TV_{AB}\\left(x_B-\\mu_B\\right)\\\\\n",
    "&+\\frac{1}{2}\\left(x_B-\\mu_B\\right)^TV_{BA}\\left(x_A-\\mu_A\\right)+\\frac{1}{2}\\left(x_B-\\mu_B\\right)^TV_{BB}\\left(x_B-\\mu_B\\right)\n",
    "\\bigg]\\Bigg)\\mathrm dx_B\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "其中$Z$是某个不依靠$x_A,x_B$的常数，我们可以先暂时忽略这个值。如果以前没有做过分块矩阵乘法，可以用一个$2\\times2$矩阵自己试一下，最终会得到：\n",
    "\n",
    "$$\n",
    "x^TAx=\\sum_i\\sum_jA_{ij}x_ix_j=x_1A_{11}x_1+x_1A_{12}x_2+x_2A_{21}x_1+x_2A_{22}x_2\n",
    "$$\n",
    "\n",
    "#### 3.2.3 使用配方法计算关于$x_B$的积分\n",
    "\n",
    "为了计算这个积分，我们需要先求出$x_B$的积分。不过，高斯分布的积分通常很难直接求解，我们可以使用特性2求出某些特定高斯分布的积分。这一小节的主要思路，就是将上一小节得到的积分通过某种“变形”写为特性2中的某个积分，以化简求解。\n",
    "\n",
    "这一步的关键就是配方法。考虑计算二次函数$z^TAz+b^z+c$，其中$A$是一个非奇异对称矩阵，则可以证明：\n",
    "\n",
    "$$\n",
    "\\frac{1}{2}z^TAz+b^Tz+c=\\frac{1}{2}\\left(z+A^{-1}b\\right)^TA\\left(z+A^{-1}b\\right)+c-\\frac{1}{2}b^TA^{-1}b\n",
    "$$\n",
    "\n",
    "这是多变量泛化的配方法，在单变量代数中为：\n",
    "\n",
    "$$\n",
    "\\frac{1}{2}az^2+bz+c=\\frac{1}{2}a\\left(z+\\frac{b}{a}\\right)^2+c-\\frac{b^2}{2a}\n",
    "$$\n",
    "\n",
    "为了使用配方法，我们令：\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "z&=x_B-\\mu_B\\\\\n",
    "A&=V_{BB}\\\\\n",
    "b&=V_{BA}\\left(x_A-\\mu_A\\right)\\\\\n",
    "c&=\\frac{1}{2}\\left(x_A-\\mu_A\\right)^TV_{AA}\\left(x_A-\\mu_A\\right)\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "则通过多变量配方法公式可得：\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "p(x_A)=\\frac{1}{Z}\\int_{x_B\\in\\mathbb R^n}\\exp\\Bigg(-\\bigg[\n",
    "&\\frac{1}{2}\\left(x_B-\\mu_B+V_{BB}^{-1}V_{BA}\\left(x_A-\\mu_A\\right)\\right)^TV_{BB}\\left(x_B-\\mu_B+V_{BB}^{-1}V_{BA}\\left(x_A-\\mu_A\\right)\\right)\\\\\n",
    "&+\\frac{1}{2}\\left(x_A-\\mu_A\\right)^TV_{AA}\\left(x_A-\\mu_A\\right)-\\frac{1}{2}\\left(\\left(x_A-\\mu_A\\right)^TV_{AB}V_{BB}^{-1}V_{BA}\\left(x_A-\\mu_A\\right)\\right)\n",
    "\\bigg]\\Bigg)\\mathrm dx_B\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "我们可以将不含$x_B$的项提出来：\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "p(x_A)=\\exp&\\left(-\\frac{1}{2}\\left(x_A-\\mu_A\\right)^TV_{AA}\\left(x_A-\\mu_A\\right)+\\frac{1}{2}\\left(\\left(x_A-\\mu_A\\right)^TV_{AB}V_{BB}^{-1}V_{BA}\\left(x_A-\\mu_A\\right)\\right)\\right)\\\\\n",
    "&\\cdot\\frac{1}{Z}\\int_{x_B\\in\\mathbb R^n}\\exp\\left(\\frac{1}{2}\\left[\\left(x_B-\\mu_B+V_{BB}^{-1}V_{BA}\\left(x_A-\\mu_A\\right)\\right)^TV_{BB}\\left(x_B-\\mu_B+V_{BB}^{-1}V_{BA}\\left(x_A-\\mu_A\\right)\\right)\\right]\\right)\\mathrm dx_B\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "现在，就可以使用特性2中的公式了，我们知道对于期望为$x$和协方差矩阵为$\\varSigma$的多元高斯分布的随机变量$x$，其概率密度函数标准化后为：\n",
    "\n",
    "$$\n",
    "\\frac{1}{\\left(2\\pi\\right)^{n/2}\\left\\lvert\\varSigma\\right\\rvert^{1/2}}\\int_{\\mathbb R^n}\\exp\\left(-\\frac{1}{2}\\left(x-\\mu\\right)^T\\varSigma^{-1}\\left(x-\\mu\\right)\\right)=1\n",
    "$$\n",
    "\n",
    "等价的有：\n",
    "\n",
    "$$\n",
    "\\int_{\\mathbb R^n}\\exp\\left(-\\frac{1}{2}\\left(x-\\mu\\right)^T\\varSigma^{-1}\\left(x-\\mu\\right)\\right)=\\left(2\\pi\\right)^{n/2}\\left\\lvert\\varSigma\\right\\rvert^{1/2}\n",
    "$$\n",
    "\n",
    "于是最终得到（改变期望或方差并不会改变概率密度函数下方的面积，全概率始终应是$1$）：\n",
    "\n",
    "$$\n",
    "p(x_A)=\\frac{1}{Z}\\cdot\\left(2\\pi\\right)^{n/2}\\left\\lvert\\varSigma\\right\\rvert^{1/2}\\cdot\\exp\\left(\\frac{1}{2}\\left(\\left(x_A-\\mu_A\\right)^T\\left(V_{AA}-V_{AB}V_{BB}^{-1}V_{BA}\\right)\\left(x_A-\\mu_A\\right)\\right)\\right)\n",
    "$$\n",
    "\n",
    "#### 3.2.4 证明得到的结果是一个高斯分布\n",
    "\n",
    "到这一步基本就结束了，忽略前面的标准化系数，可以看出$x_A$的概率密度是一个指数上带有$x_A$二次型的函数。同时，我们很容易看出是一个以$\\mu_A$为期望、$V_{AA}-V_{AB}V_{BB}^{-1}V_{BA}$为协方差矩阵的高斯分布。尽管这个协方差矩阵有一些复杂，我们依旧可以得到——$x_A$的边缘分布仍是一个高斯分布。根据前面的推理，可以猜测到这个复杂的协方差矩阵可以通过某种途径化简为$\\varSigma_{AA}$。\n",
    "\n",
    "这个推导需要用到一个分块矩阵的公式：\n",
    "\n",
    "$$\n",
    "\\begin{bmatrix}\n",
    "A&B\\\\C&D\n",
    "\\end{bmatrix}^{-1}=\n",
    "\\begin{bmatrix}\n",
    "M^{-1}&-M^{-1}BD^{-1}\\\\\n",
    "-D^{-1}CM^{-1}&D^{-1}+D^{-1}CM^{-1}BD^{-1}\n",
    "\\end{bmatrix}\n",
    ",\\ M=A-BD^{-1}C$$\n",
    "\n",
    "这个公式可以看做是$2\\times2$矩阵求逆（利用$A^{-1}=\\frac{1}{\\det A}\\mathrm{adj}A$）的多变量泛化形式：\n",
    "\n",
    "$$\n",
    "\\begin{bmatrix}\n",
    "a&b\\\\c&d\n",
    "\\end{bmatrix}^{-1}=\n",
    "\\frac{1}{ad-bc}\n",
    "\\begin{bmatrix}\n",
    "d&-b\\\\-c&a\n",
    "\\end{bmatrix}\n",
    "$$\n",
    "\n",
    "使用这个公式，可以得到：\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "\\begin{bmatrix}\n",
    "\\varSigma_{AA}&\\varSigma_{AB}\\\\\n",
    "\\varSigma_{BA}&\\varSigma_{BB}\n",
    "\\end{bmatrix}&=\n",
    "\\begin{bmatrix}\n",
    "V_{AA}&V_{AB}\\\\\n",
    "V_{BA}&V_{BB}\n",
    "\\end{bmatrix}^{-1}\\\\\n",
    "&=\\begin{bmatrix}\n",
    "\\left(V_{AA}-V_{AB}V_{BB}^{-1}V_{BA}\\right)^{-1}&-\\left(V_{AA}-V_{AB}V_{BB}^{-1}V_{BA}\\right)^{-1}V_{AB}V_{BB}^{-1}\\\\\n",
    "-V_{BB}^{-1}V_{BA}\\left(V_{AA}-V_{AB}V_{BB}^{-1}V_{BA}\\right)^{-1}&\\left(V_{BB}-V_{BA}V_{AA}^{-1}V_{AB}\\right)^{-1}\n",
    "\\end{bmatrix}\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "我们可以很容易的发现$\\left(V_{AA}-V_{AB}V_{BB}^{-1}V_{BA}\\right)^{-1}=\\varSigma_{AA}$，得证。\n",
    "\n",
    "### 3.3 联合高斯分布的条件分布仍是高斯分布\n",
    "\n",
    "这条规则的正是描述应为：\n",
    "\n",
    "设有$\n",
    "\\begin{bmatrix}\n",
    "x_A\\\\x_B\n",
    "\\end{bmatrix}\n",
    "\\sim\n",
    "\\mathcal N\n",
    "\\left(\n",
    "\\begin{bmatrix}\n",
    "\\mu_A\\\\\\mu_B\n",
    "\\end{bmatrix}\n",
    ",\n",
    "\\begin{bmatrix}\n",
    "\\varSigma_{AA}&\\varSigma_{AB}\\\\\n",
    "\\varSigma_{BA}&\\varSigma_{BB}\n",
    "\\end{bmatrix}\n",
    "\\right)\n",
    "$，其中的随机变量为$x_A\\in\\mathbb R^m,x_B\\in\\mathbb R^n$，式中的期望向量及协方差矩阵都是满足这两个随机变量维度要求的。则条件分布为：\n",
    "$$\n",
    "\\begin{align}\n",
    "p\\left(x_A\\mid x_B\\right)&=\\frac{p\\left(x_A,x_B;\\mu,\\varSigma\\right)}{\\int_{x_B\\in\\mathbb R^n}p\\left(x_A,x_B;\\mu,\\varSigma\\right)\\mathrm dx_B}\\\\\n",
    "p\\left(x_B\\mid x_A\\right)&=\\frac{p\\left(x_A,x_B;\\mu,\\varSigma\\right)}{\\int_{x_A\\in\\mathbb R^m}p\\left(x_A,x_B;\\mu,\\varSigma\\right)\\mathrm dx_A}\n",
    "\\end{align}\n",
    "$$\n",
    "均为高斯分布：\n",
    "$$\n",
    "\\begin{align}\n",
    "x_A\\mid x_B&\\sim\\mathcal N\\left(\\mu_A+\\varSigma_{AB}\\varSigma_{BB}^{-1}(x_B-\\mu_B),\\ \\varSigma_{AA}-\\varSigma_{AB}\\varSigma_{BB}^{-1}\\varSigma_{BA}\\right)\\\\\n",
    "x_B\\mid x_A&\\sim\\mathcal N\\left(\\mu_B+\\varSigma_{BA}\\varSigma_{AA}^{-1}(x_A-\\mu_A),\\ \\varSigma_{BB}-\\varSigma_{BA}\\varSigma_{AA}^{-1}\\varSigma_{AB}\\right)\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "与上一个证明类似，我们只需要证明$x_B\\mid x_A$成立，另一个证明也会因为对称性而成立。这个证明也分为四个步骤：\n",
    "\n",
    "1. 严格的用积分形式表达出边缘分布；\n",
    "2. 将协方差矩阵的逆写成分块矩阵的形式；\n",
    "3. 使用配方法；\n",
    "4. 证明得到的结果服从高斯分布。 \n",
    "\n",
    "具体证明如下：\n",
    "\n",
    "#### 3.3.1 条件分布概率密度的积分形式\n",
    "\n",
    "假设我们想要计算在条件$x_A$下$x_B$的概率，则应该计算：\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "p\\left(x_B\\mid x_A\\right)&=\\frac{p\\left(x_A,x_B;\\mu,\\varSigma\\right)}{\\int_{x_B\\in\\mathbb R^n}p\\left(x_A,x_B;\\mu,\\varSigma\\right)\\mathrm dx_B}\\\\\n",
    "&=\\frac{1}{Z'}\\exp\\left(-\\frac{1}{2}\\begin{bmatrix}x_A-\\mu_A\\\\x_B-\\mu_B\\end{bmatrix}^T\\begin{bmatrix}\\varSigma_{AA}&\\varSigma_{AB}\\\\\\varSigma_{BA}&\\varSigma_{BB}\\end{bmatrix}^{-1}\\begin{bmatrix}x_A-\\mu_A\\\\x_B-\\mu_B\\end{bmatrix}\\right)\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "其中的$Z'$依然是标准化常量，我们用它来存放那些不依靠$x_B$的项。注意到在这次的证明中，我们甚至不需要计算积分，因为分母上的积分不依靠$x_B$，所以分母上的积分都可以放进$Z'$（视为常量）。\n",
    "\n",
    "#### 3.2.2 协方差矩阵分块\n",
    "\n",
    "与上一个证明一样，使用矩阵$V$做参数：\n",
    "\n",
    "$$\n",
    "\\begin{eqnarray}\n",
    "p\\left(x_B\\mid x_A\\right)&=\\frac{1}{Z'}\\exp\\Bigg(-\\frac{1}{2}&\\begin{bmatrix}x_A-\\mu_A\\\\x_B-\\mu_B\\end{bmatrix}^T\\begin{bmatrix}\\varSigma_{AA}&\\varSigma_{AB}\\\\\\varSigma_{BA}&\\varSigma_{BB}\\end{bmatrix}^{-1}\\begin{bmatrix}x_A-\\mu_A\\\\x_B-\\mu_B\\end{bmatrix}\\Bigg)\\\\\n",
    "p\\left(x_A\\right)&=\\frac{1}{Z'}\\exp\\Bigg(-\\bigg[\n",
    "&\\frac{1}{2}\\left(x_A-\\mu_A\\right)^TV_{AA}\\left(x_A-\\mu_A\\right)+\\frac{1}{2}\\left(x_A-\\mu_A\\right)^TV_{AB}\\left(x_B-\\mu_B\\right)\\\\\n",
    "&&+\\frac{1}{2}\\left(x_B-\\mu_B\\right)^TV_{BA}\\left(x_A-\\mu_A\\right)+\\frac{1}{2}\\left(x_B-\\mu_B\\right)^TV_{BB}\\left(x_B-\\mu_B\\right)\n",
    "\\bigg]\\Bigg)\n",
    "\\end{eqnarray}\n",
    "$$\n",
    "\n",
    "#### 3.3.3 使用配方法\n",
    "\n",
    "再上一个证明中，我们提到多变量情形中的配方法：\n",
    "\n",
    "$$\n",
    "\\frac{1}{2}z^TAz+b^Tz+c=\\frac{1}{2}\\left(z+A^{-1}b\\right)^TA\\left(z+A^{-1}b\\right)+c-\\frac{1}{2}b^TA^{-1}b\n",
    "$$\n",
    "\n",
    "其中$A$是对称非奇异矩阵，类似上一个证明，令：\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "z&=x_B-\\mu_B\\\\\n",
    "A&=V_{BB}\\\\\n",
    "b&=V_{BA}\\left(x_A-\\mu_A\\right)\\\\\n",
    "c&=\\frac{1}{2}\\left(x_A-\\mu_A\\right)^TV_{AA}\\left(x_A-\\mu_A\\right)\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "则可以将$p\\left(x_B\\mid x_A\\right)$重写为：\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "p\\left(x_B\\mid x_A\\right)=\\frac{1}{Z'}\\exp\\Bigg(-\\bigg[\n",
    "&\\frac{1}{2}\\left(x_B-\\mu_B+V_{BB}^{-1}V_{BA}\\left(x_A-\\mu_A\\right)\\right)^TV_{BB}\\left(x_B-\\mu_B+V_{BB}^{-1}V_{BA}\\left(x_A-\\mu_A\\right)\\right)\\\\\n",
    "&+\\frac{1}{2}\\left(x_A-\\mu_A\\right)^TV_{AA}\\left(x_A-\\mu_A\\right)-\\frac{1}{2}\\left(\\left(x_A-\\mu_A\\right)^TV_{AB}V_{BB}^{-1}V_{BA}\\left(x_A-\\mu_A\\right)\\right)\n",
    "\\bigg]\\Bigg)\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "将那些不依靠$x_B$的项提出来放在标准化常量中，得到：\n",
    "\n",
    "$$\n",
    "p\\left(x_B\\mid x_A\\right)=\\frac{1}{Z''}\\exp\\left(\\frac{1}{2}\\left[\\left(x_B-\\mu_B+V_{BB}^{-1}V_{BA}\\left(x_A-\\mu_A\\right)\\right)^TV_{BB}\\left(x_B-\\mu_B+V_{BB}^{-1}V_{BA}\\left(x_A-\\mu_A\\right)\\right)\\right]\\right)\n",
    "$$\n",
    "\n",
    "#### 3.3.4 证明得到的结果是一个高斯分布\n",
    "\n",
    "观察上一小节的最后，$p\\left(x_B\\mid x_A\\right)$形为期望是$\\mu_B+V_{BB}^{-1}V_{BA}\\left(x_A-\\mu_A\\right)$协方差是$V_{BB}^{-1}$的高斯分布，像上一个证明中做的那样：\n",
    "\n",
    "$$\n",
    "\\begin{align}\n",
    "\\begin{bmatrix}\n",
    "\\varSigma_{AA}&\\varSigma_{AB}\\\\\n",
    "\\varSigma_{BA}&\\varSigma_{BB}\n",
    "\\end{bmatrix}&=\n",
    "\\begin{bmatrix}\n",
    "V_{AA}&V_{AB}\\\\\n",
    "V_{BA}&V_{BB}\n",
    "\\end{bmatrix}^{-1}\\\\\n",
    "&=\\begin{bmatrix}\n",
    "\\left(V_{AA}-V_{AB}V_{BB}^{-1}V_{BA}\\right)^{-1}&-\\left(V_{AA}-V_{AB}V_{BB}^{-1}V_{BA}\\right)^{-1}V_{AB}V_{BB}^{-1}\\\\\n",
    "-V_{BB}^{-1}V_{BA}\\left(V_{AA}-V_{AB}V_{BB}^{-1}V_{BA}\\right)^{-1}&\\left(V_{BB}-V_{BA}V_{AA}^{-1}V_{AB}\\right)^{-1}\n",
    "\\end{bmatrix}\n",
    "\\end{align}\n",
    "$$\n",
    "\n",
    "于是可以得出：\n",
    "\n",
    "$$\n",
    "\\mu_{B\\mid A}=\\mu_B-V_{BB}^{-1}V_{BA}\\left(x_A-\\mu_A\\right)=\\mu_B+\\varSigma_{BA}\\varSigma_{AA}^{-1}\\left(x_A-\\mu_A\\right)\n",
    "$$\n",
    "\n",
    "反过来，我们也可以使用$\\varSigma$矩阵表达$V$矩阵：\n",
    "\n",
    "$$\n",
    "\\begin{bmatrix}\n",
    "V_{AA}&V_{AB}\\\\\n",
    "V_{BA}&V_{BB}\n",
    "\\end{bmatrix}=\n",
    "\\begin{bmatrix}\n",
    "\\left(\\varSigma_{AA}-\\varSigma_{AB}\\varSigma_{BB}^{-1}\\varSigma_{BA}\\right)^{-1}&\n",
    "-\\left(\\varSigma_{AA}-\\varSigma_{AB}\\varSigma_{BB}^{-1}\\varSigma_{BA}\\right)\\varSigma_{AB}\\varSigma_{BB}^{-1}\\\\\n",
    "-\\varSigma_{BB}^{-1}\\varSigma_{BA}\\left(\\varSigma_{AA}-\\varSigma_{AB}\\varSigma_{BB}^{-1}\\varSigma_{BA}\\right)&\n",
    "\\left(\\varSigma_{BB}-\\varSigma_{BA}\\varSigma_{AA}^{-1}\\varSigma_{AB}\\right)^{-1}\n",
    "\\end{bmatrix}\n",
    "$$\n",
    "\n",
    "于是得到：\n",
    "\n",
    "$$\n",
    "\\varSigma_{B\\mid A}=V_{BB}^{-1}=\\varSigma_{BB}-\\varSigma_{BA}\\varSigma_{AA}^{-1}\\varSigma_{AB}\n",
    "$$\n",
    "\n",
    "证毕。\n",
    "\n",
    "## 4. 总结\n",
    "\n",
    "在这篇笔记中，我们使用了多元高斯分布的几个简单特性（以及一些矩阵计算技巧）证明了多元高斯分布的几个闭包特性。多元高斯分布是一个非常有用的概率分布，部分因为这些闭包特性可以使得我们对多元高斯分布的大部分操作有解析解。从解析角度讲，在实际当中，涉及多元高斯分布的积分通常比较容易处理，因为我们可以利用已知的关于高斯分布的积分的结论，从而避免自己计算这些高斯分布的积分。\n",
    "\n",
    "## 5. 练习\n",
    "\n",
    "令$A\\in\\mathbb R^{n\\times n}$为对称非奇异矩阵，$b\\in\\mathbb R^n$，证明：\n",
    "\n",
    "$$\n",
    "\\int_{x\\in\\mathbb R^n}\\exp\\left(-\\frac{1}{2}x^TAx-x^Tb-c\\right)\\mathrm dx=\\frac{(2\\pi)^{n/2}}{\\lvert A\\rvert^{1/2}\\exp\\left(c-b^TA^{-1}b\\right)}\n",
    "$$\n",
    "\n",
    "（使用配方法易证。）"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
